Skip to main content

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: garthus Date: Nov 29, 2009 8:24am
Forum: texts Subject: Re: any proofreading of the texts you include in your collections?


This is only OCR'd text. Some books have been through Project Gutenberg:

They have good full-text files. Eventually all of this will be coordinated through Open Library. Since the work is done by volunteers, it can only get done at the rate which they can work. In any case the prime directive would be to get the works scanned and archived, all of this can and will come later.


Reply to this post
Reply [edit]

Poster: stbalbach Date: Nov 29, 2009 11:40am
Forum: texts Subject: Re: any proofreading of the texts you include in your collections?

See also "Distributed Proofreaders"

They are always looking for volunteers to help.

Also this announcement from IA:'s not entirely clear what this is about, but it does involve "improved OCR accuracy". We may get to the point where machine OCR is good-enough, without need for proofreaders, at least for general reading purposes.


Reply to this post
Reply [edit]

Poster: garthus Date: Nov 29, 2009 1:40pm
Forum: texts Subject: Re: any proofreading of the texts you include in your collections?


I forgot to mention them. I have over 50 books passing through distributed proofreaders at this time. Readers should go to that site and concentrate as much free time as is possible proofreading since ultimately it will all go into the Open Library.


Reply to this post
Reply [edit]

Poster: Time Traveller Date: Nov 29, 2009 5:53pm
Forum: texts Subject: Re: any proofreading of the texts you include in your collections?

While software is great with OCR, other people use great software to create new type fonts. Its a continuing war of catch up by OCR software writers to people wanting to be unique and creative with new fonts.

With advertising material, the creators go to extremes with type fonts, yet such material comes in limited quantities, preventing OCR software from learning and adapting.

It goes without saying, on the IA, that only the original scans can be trusted for accuracy.

If the truth is acknowledged, not even human proof readers are 100% actuate, if they go for nothing but 100% accuracy, then lots less texts are going to be proof read.

I have to admit I am guilty of twice now, uploading a PDF only to make use of the IA's OCR software, when I am using a borrowed PC without any OCR application installed. (I set the IA page to auto-delete in 30-days, but before that, I delete all the files after I got my plain text.)

I find consumer quality OCR software extremely costly, while the Archive must use even better commercial grade OCR software.

When the IA states that a format is editable text, it only means that, once the text is downloaded to your PC it is editable for your own uses, mostly meaning, you can copy and paste sections of text into your own documents, such as a school report, or project report for your senior manager, and even for a historian writeing a book.

But then, it is expected, and common sense, for you, to proof read the material you have copied.

Therefore, I expect 99.99% of people only read the original book scan in PDF format, and the IA just runs OCR over PDFs, as a helpful convenience for some people needing to exactly quote a item from a text.

I for one, much rather proof read a item of copied and pasted text, than retype it all.

If just anybody can edit IA OCR derived text material on the Archive, you can be sure more errors creep in, but you also give vandals a opening, such as holocaust deniers.

Like at the trouble that Wikipedia has sometimes, when some well known people try and change their personal history.

Sometimes too, some people believing they are very clued up, see a perceived error in a book, and without research, they correct the book. I see this often, in non-fiction books I borrow from the local library. One bad case was, a book about Apollo 13 having facts changed by a previous borrower, to reflect the Apollo 13 movie where its writers had taken liberty with historic facts, to create a profitable story line.

But even if a book on the IA contains errors by the author, it should never be corrected because, for example, a historian 200-years from now (Should this world survive) could have a eureka moment; "That is where Thomas Edison got it wrong, his reference book had the error, it was not him after all"

And also, errors might indicate a author being unsure, and then, it might not be any error, just the author being 100s of years ahead of the thinking of any of his peers.

There are differences in spelling world wide, colour and color for example, a historian could such spelling differences to point at a passage in a book, suggesting, that chapter by a author, might very well have been written by somebody else.

Place names, in NZ there has been recent argument about how the name of a certain small city is spelt, with a H or not, its region having the same name, but spelt different. Researchers went into text archives dating back to the 1840s, coming up with a answer, yet its still being argued.

That is why its important not to allow apparent spelling errors not to be corrected on the IA.

Due to the poor quality of some very old books being scanned for the IA, its 100% sure that OCR errors can and do happened.

So on the IA, only the scanned text PDF can be relied on, the fact that the IA runs OCR over uploaded PDFs, should not be seen that the IA says the resulting text is 100% correct, the IA only uses OCR as a convenience for its users, NOT as part of the historic archive records.

Does the IA have a OCR disclaimer on its Website? Maybe a link should be automatically added to every text description page?

I agree with Gerry, with the world moving towards digital records for current records, news, information, the cost of space for storing physical books becoming astronautical, a new generation of kids being bought up on E-readers, there is extreme urgency to digitise as much of history as we can save, us IA volunteers are saving for the future, the lesser items, while official agencies are prioritising what they save and burning the rest.

Look back at the archaeologists digging up old ruins, to discover the most basic things people did back then, remaining official records, were just that, official.

Today, there is a record amount of time limited print material being produced (The last major fling of the paper producers)such things as a chain store's Christmas catalogue would be just as valuable to historians in 100 years time, as today's historians value the Canadian department store's seasonal catalogues showing, for example, fashions. (They are free to download from the IA Canadian Librarians catalogue)

Such current material, as well as old materials, are going to be lost for ever, if we wait for the proof readers to catch up.

And I am beginning to believe, knowing what's on the IA already, being a mere fraction of what should be already on the IA, that it might be 1000s of years before human proof readers catch up with what's already digitised today.

The IA should be seen as only a depositary for old texts, to prevent such texts being lost for ever and ever, while other groups of volunteers slowly prod along, flat out, proof reading the OCRed text.

Question to ponder: I just viewed a IA sourced news reel from the late 1940's showing a copy of the Gutenberg Bible. The narrator said Gutenberg invented the printing press, not true, he was the European inventor of reusable type. (maybe not the first European to think of the idea, but the first to work out a way to make the type, so it was all of equal height, so all letters on a page were of equal impression.)

Would today's OCR do a good job on the original Gutenberg printed bible?

What about the very standardised characters of books copied by monks, before any form of machine copying? (a page of text, hand carved on one printing plate.)

As for recent text knowledge being lost, I think it was last year that NASA engineers went to a museum to dismantle one of the remaining unused, in-mint-condition Apollo Command/Service modules, to RE-learn how the wiring and life support lines, ran between the two modules, and got disconnected just prior to the non-postponable rentry, on the way back from the moon.

The last Apollo's went up in the early 1970's and NASA no longer had any blue-prints or design documents left, to look back on, when they needed ideas for the new Orion man carrying capsule.

After all, NASA never though it would ever have anything but Shuttle type vehicles after Apollo. It even destroyed the jigs and patterns for the Saturn launchers, to prevent Congress from cancelling the Shuttle Program due to billion dollar cost overruns before Columbia's maiden voyage.

Jigs and patterns, can be recreated from blue prints and documentation.

So, even in the '80s, valuable documents wre being destroyed, by government agencies, certainly non-government agencies are doing the same.

Forget about OCR making obvious mistakes, because that can be fixed in the future, the big urgency is to digitise the original documents.

AND let's hope, we can still read digital in the future. In the '70s, the British Government funded what is called the "Doomsday Book" a large diameter laser read optical disk, with a snapshot of everything in the UK, things like folk dances in video, newspapers scanned, sound bites, etc.

Last I heard, plenty of these disks still around, but no disk reader!!!!!!!!!!!


This post was modified by Time Traveller on 2009-11-30 01:53:35